Lecture’s plan

  1. How to do feature selection for text data?

  2. Is PCA a FS method for text?

  3. Other methods?

An illustration of VS model

  • All documents are projected into this concept space

Feature selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. email spam classification)

Feature selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. email spam classification)


The data has 10,000 fields (features)

Feature selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. email spam classification)


The data has 10,000 fields (features)


you need to cut it down to 1,000 fields before you try machine learning. Which 1,000?

Feature selection: What

You have some data, and you want to use it to build a classifier, so that you can predict something (e.g. email spam classification)


The data has 10,000 fields (features)


you need to cut it down to 1,000 fields before you try machine learning. Which 1,000?


The process of choosing the 1,000 fields to use is called Feature Selection

Feature selection: Why

Why accuracy reduces

  • Suppose the best feature set has 20 features.
  • If you add another 5 features, typically the accuracy of machine learning may reduce.
  • But you still have the original 20 features!
  • Why does this happen?

Noise / Explosion

  • The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set.

  • For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…)

  • The increased space of possibilities is more difficult to search.

Feature selection

Why we need FS:

1. To improve performance (in terms of speed, predictive power, simplicity of the model).

2. To visualize the data for model selection.

3. To reduce dimensionality and remove noise.

Feature Selection is a process that chooses an optimal subset of features according to a certain criterion.

Feature selection for text

Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.

  • high dimensionality of text features
  • Select the most informative features for model training
    • Reduce noise in feature representation
    • Improve final classification performance
    • Improve training/testing efficiency
      • Less time complexity
      • Fewer training data

Feature Selection Methods

Feature selection methods

  • Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines.

Feature selection methods

  • Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines.

Some criteria

A choice of feature selection ranking methods depending on the nature of:
  • the variables and the target (binary, categorical, continuous)
  • the problem (dependencies between variables, linear/non-linear relationships between variables and target)
  • the available data(number of examples and number of variables, noise in data)
  • the available tabulated statistics

Filters, Wrappers, and Embedded methods

Wrapper Methods

Feature selection methods

  • Wrapper method
    • Find the best subset of features for a particular classification method

Feature selection: Wrappers

  • Optimizes for a specific learning algorithm
  • The feature subset selection algorithm is a “wrapper” around the learning algorithm
    1. Pick a feature subset and pass it in to learning algorithm
    2. Create training/test set based on the feature subset
    3. Train the learning algorithm with the training set
    4. Find accuracy (objective) with validation set
    5. Repeat for all feature subsets and pick the feature subset which led to the highest predictive accuracy (or other objective)
  • Basic approach is simple
  • Variations are based on how to select the feature subsets, since there are an exponential number of subsets

Feature selection: Wrappers

  • Exhaustive Search - Exhausting
  • Forward Search – \(O(n^2 \cdot \text{learning/testing time})\) - Greedy
    1. Score each feature by itself and add the best feature to the initially empty set FS (FS will be our final Feature Set)
    2. Try each subset consisting of the current FS plus one remaining feature and add the best feature to FS
    3. Continue until stop getting significant improvement (over a window)
  • Backward Search – \(O(n^2 \cdot \text{learning/testing time})\) - Greedy
    1. Score the initial complete FS
    2. Try each subset consisting of the current FS minus one feature in FS and drop the feature from FS causing least decrease in accuracy
    3. Continue until begin to get significant decreases in accuracy
  • Branch and Bound and other heuristic approaches available

Wrapper method

  • Wrapper method
    • Search in the whole space of feature groups
      • Sequential forward selection or genetic search to speed up the search

Wrapper method

  • Wrapper method
    • Consider all possible dependencies among the features
    • Impractical for text categorization
      • Cannot deal with large feature set
      • A NP-complete problem
        • No direct relation between feature subset selection and evaluation

Wrappers for feature selection

Search strategies

  • Exhaustive search.
  • Simulated annealing, genetic algorithms.
  • Beam search: keep k best path at each step.
  • Greedy search: forward selection or backward elimination.
  • PTA(l,r): plus l , take away r – at each step, run SFS l times then SBS r times.
  • Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly.

Filter Methods

Filter method

  • Filter method
    • Evaluate the features independently from the classifier and other features
      • No indication of a classifier’s performance on the selected features
      • No dependency among the features
    • Feasible for very large feature set

Document frequency

  • Rare words: non-influential for global prediction, reduce vocabulary size

Information gain

  • Decrease in entropy of categorical prediction when the feature is present or absent

Gini index

Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:

\[\sum^k_{c=1}{p(c | t)=1}\]

Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:

\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]

Gini Index

  • The value of the gini-index lies in the range \((1/k, 1)\).

  • Higher values of the gini-index indicate a greater discriminative power of the term \(t\).

  • If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.

  • âž” normalized gini-index

Feature scoring metrics

  • \({\chi}^2\) statistics with multiple categories

    • \({\chi}^2=\sum_c{p(c) {\chi}^2(c,t)}\)

      • Expectation of \({\chi}^2\) over all the categories

    • \({\chi}^2(t) = \underset{c}{max}\ {\chi}^2(c,t)\)

      • Strongest dependency between a category and a term

Other metrics

  • Many other metrics (Same trick as in \(\chi^2\) statistics for multi-class cases)

    • Mutual information

      • Relatedness between term \(t\) and class \(c\)

    \[PMI(t;c) = p(t,c)log(\frac{p(t,c)}{p(t)p(c)})\]

    • Odds ratio

      • Odds of term \(t\) occurring with class \(c\) normalized by that without \(c\)

\[Odds(t;c) = \frac{p(t,c)}{1 - p(t,c)} \times \frac{1 - p(t,\bar{c})}{p(t,\bar{c})}\]

Embedded Methods

Formalism

  • Many learning algorithms are cast into a minimization of some regularized functional:

\[\min_\alpha{\hat{R}(\alpha, \sigma)} = \min_\alpha{\sum_{k=1}^m{L(f(\alpha, \sigma \circ x_k), y_k) + \Omega(\alpha)}}\]

Formalism

  • Many learning algorithms are cast into a minimization of some regularized functional:

Embedded method

  • Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms:
    • Find a functional that represents your prior knowledge about what a good model is.
    • Add the s weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently
    • Optimize alternatively according to \(\alpha\) and \(\sigma\)
    • Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features
  • Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Lasso vs Ridge

The \(l_1\) SVM

  • A version of SVM where \(\Omega(w) = ||w||^2\) is replaced by the \(l_1\) norm \(\Omega(w) = \sum_i{|w_i|}\)
  • Can be considered an embedded feature selection method:
    • Some weights will be drawn to zero (tend to remove redundant features)
    • Difference from the regular SVM where redundant features are included

The \(l_0\) SVM

  • Replace the regularizer \(||w||^2\) by the \(l_0\) norm \(\sum_{i=1}^n{1_{w_i \neq 0}}\)

  • Further replace \(\sum_{i=1}^n{1_{w_i \neq 0}}\) by \(\sum_i{log{(\epsilon + |w_i|)}}\)

  • Boils down to the following multiplicative update algorithm:

    1. Set \(\sigma = (1, ..., 1)\)
    2. Get \(w^*\) solution of an SVM on data set where each input is scaled by \(\sigma\)
    3. Set \(\sigma = |w^*| \circ \sigma\)
    4. back to 2

PCA

Feature selection vs feature reduction

  • Feature Selection seeks a subset of the \(n\) original features which retains most of the relevant information
    • Wrappers (e.g. forward selection), Filters (e.g. PMI), Embedded (e.g. Lasso, Regularized SVN)
  • Feature Reduction combines/fuses the \(n\) original features into a smaller set of newly created features which hopefully retains most of the relevant information from all the original features - Data fusion (e.g. LDA, PCA, etc.)

PCA: Principal Component Analysis

  • PCA is one of the most common feature reduction techniques

  • A linear method for dimensionality reduction

  • Allows us to combine much of the information contained in \(n\) features into \(p\) features where \(p < n\)

  • PCA is unsupervised in that it does not consider the output class/value of an instance – There are other algorithms which do (e.g. Linear Discriminant Analysis)

  • PCA works well in many cases where data have mostly linear correlations

PCA overview

  • Seek new set of bases which correspond to the highest variance in the data
  • Transform \(n\)-dimensional normalized data to a new \(n\)-dimensional basis
    • The new dimension with the most variance is the first principal component
    • The next is the second principal component, etc.
    • Note \(z_1\) combines/fuses significant information from both \(x_1\) and \(x_2\)
  • Drop dimensions for which there is little variance

Variance and covariance

  • Variance is a measure of data spread in one dimension (feature)
  • Covariance measures how two dimensions (features) vary with respect to each other

\[var(X) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(X_i - \bar X)}}{(n - 1)}\] \[cov(X,Y) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(Y_i - \bar Y)}}{(n - 1)}\]

Covariance and the covariance matrix

  • Considering the sign (rather than exact value) of covariance:
    • Positive value means that as one feature increases or decreases the other does also (positively correlated)
    • Negative value means that as one feature increases the other decreases and vice versa (negatively correlated)
    • A value close to zero means the features are independent
    • If highly covariant, are both features necessary?
  • Covariance matrix is an \(n \times n\) matrix containing the covariance values for all pairs of features in a data set with \(n\) features (dimensions)
  • The diagonal contains the covariance of a feature with itself which is the variance (which is the square of the standard deviation)
  • The matrix is symmetric

PCA example

  • First step is to center the original data around 0 by subtracting the mean in each dimension

PCA example

  • Second: Calculate the covariance matrix of the centered data
  • Only \(2 \times 2\) for this case

PCA example

  • Third: Calculate the unit eigenvectors and eigenvalues of the covariance matrix (remember linear algebra)
    • Covariance matrix is always square \(n \times n\) and positive semi-definite, thus n non-negative eigenvalues will exist
    • All eigenvectors (principal components) are orthogonal to each other and form the new set of bases/dimensions for the data (columns)
    • The magnitude of each eigenvalue corresponds to the variance along that new dimension – Just what we wanted!
    • We can sort the principal components according to their eigenvalues
    • Just keep those dimensions with the largest eigenvalues

PCA example

  • Below are the two eigenvectors overlaying the centered data
  • Which eigenvector has the largest eigenvalue?
  • Fourth Step: Just keep the \(p\) eigenvectors with the largest eigenvalues
    • Do lose some information, but if we just drop dimensions with small eigenvalues then we lose only a little information, hopefully noise
    • We can then have \(p\) input features rather than \(n\)
    • The \(p\) features contain the most pertinent combined information from all \(n\) original features
    • How many dimensions \(p\) should we keep?

PCA example

  • Last Step: Transform the \(n\) features to the \(p\) (\(< n\)) chosen bases (Eigenvectors)
  • Transformed data (m instances) is a matrix multiply \(T = A \times B\)
    • \(A\) is a \(p \times n\) matrix with the \(p\) principal components in the rows, component one on top
    • \(B\) is a \(n \times m\) matrix containing the transposed centered original data set
    • \(T^T\) is a \(m \times p\) matrix containing the transformed data set
  • Now we have the new transformed data set with dimensionality \(p\)
  • Keep matrix \(A\) to transform future centered data instances
  • Below is shown the transform of both dimensions. Would if we just kept the \(1^{st}\) component?

PCA algorithm summary

  1. Center the \(n\) normalized TS features (subtract the n means)
  2. Calculate the covariance matrix of the centered TS
  3. Calculate the unit eigenvectors and eigenvalues of the covariance matrix
  4. Keep the \(p\) (\(< n\)) eigenvectors with the largest eigenvalues
  5. Matrix multiply the p eigenvectors with the centered TS to get a new TS with only \(p\) features
  • Given a novel instance during execution
    1. Center the normalized instance (subtract the \(n\) means)
    2. Do the matrix multiply (step 5 above) to change the new instance from \(n\) to \(p\) features

Summary

Summary

  • Feature selection for text
  • Different methods
  • Can be quite effective!

Time for Practical 3!